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ABSTRACT 

A many- faceted Rasch model (FACETS) is presented for 
the measurement of writing ability. The FACETS model is a 
multivariate extension of Rasch measurement models that can be used 
to provide a framework for calibrating both raters and writing tasks 
within the context of writing assessment. A FACETS model is described 
based on the current procedures of the Georgia Basic Skills Writing 
Test. These procedures can be seen as a prototype for other statewide 
assessments of writing. A small data set produced by 15 eighth grade 
students from tiiis assessment is analyzed with the FACETS computer 
program of J. M. Linacre (1989). r ihe FACETS model offers a promising 
approach for solving a variety of measurement problems in the 
statewide assessment of writing ability. It provides a framework for 
obtaining objective linear measurements of writing ability that 
generalize beyond specific ratters and writing tasks. The FACETS model 
can also be applied to the assessment of writing ability based on 
holistic scoring and exploration or issues related to bias. A 37-item 
list of references, 5 tables, and 2 figures are included. (SLD) 
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Abstract 

llie parpoee of this study is to descri be a Many-Faceted Rasch 
(FACETS) model far the neasurerasit of writing ability. The FACETS 
model is a multivariate extension of Rasch measurement models that 
can be used to provide a framework for calibrating both raters and 
writing tasks within the context of writing assessment. The use of 
the FACETS model for solving measurement problems encountered in 
the assessment of writing ability is presented here. A small data 
set from a statewide assessment of writing ability is used to 
illustrate the FACETS model. 
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THE MEASUREMENT OF WRITING ABILITY WITH A MANY-FACETED RASCH MODEL 

Direct assessments of student writing ability are currently 
bslng conducted or planned in almost every state (Afflerbach, 
1985) . These statewide writing assessments are generally 
high-stakes 1 sts for examinees with direct consequences tor 
instructional placement, grade-to-grade pronation and high school 
gra du ati on . National assessments of writing ability (Applebee, 
Langer, & Mullis, 1985; Applebee, Ianger, Jenkins, Mullis & 
Joertsch, 1990) , as wen as international assessments (Gorman, 
Purves & Degerihart, 1988) , have also been conducted using essays 
written by students. 

In spite of the increase in direct assessments of writing 
ability, relatively little is known about the validity of current 
measurement procedures for estimating writing ability. The 
objective assessment of writing ability based on student essays 
presents a variety of measurement problems that are difficult to 
addre ss within the framework of current test theories that are 
primarily designed to model dichctomous data from multiple-choice 
items. 

The first problem is that most of the common scoring 
procedures for essays are based on nai-dichctomous ratings, such as 
the traditional Likert-type scales; this is the case whetlier 
holistic (Cooper, 1977) or some form of analytic scoring is used 
(Uoyd-Jones, 1977) . Recent work on psychometric models for this 
type of data has contribu t ed to our understanding of rating scales 
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(Wright & Masters, 1982) , and seme of these models have been used 
to analyze student essays (Pollitt & Hutchinson, 1987) . A second 
problem is that the ratings of the essays are made by raters who 
introduce a source of variation into the measurement process that 
is not found in multiple-choice tests. Several studies have 
suggested that in spite of thorough training raters still vary in 
severity (Iunz, Wright, and Linacre, 1990) and inter-rater 
reliability remains a significant problem (Braun, 1988; Cohen, 
1960) . As pointed out by Coffman (1971) in his review of the 
literature, one of the major problems with essay examinations is 
that when different raters are asked to rate the same essay they 
tend to disagree in their ratings. A third problem encountered 
within the context of statewide assessments of writing ability is 
hew to adjust for differences in writing task difficulty when 
students respond to different writing tasks. There is substantial 
evidence that writing tasks do differ in difficulty (Ruth & Murphy, 
1988). 

These, measurement problems led earlier psychcmecricians to a 
Procrustean approach to writing assessment based on multiple-choice 
items. These indirect assessments led to reliable estimates of 
writing ability based on standard criteria used with traditional 
test theory for multiple-choice items. Although there is some 
evidence that different traits were being measured as a function 
of test format (Ackerman & Smith, 1988) , indirect assessments of 
writing ability tend to be highly correlated with ratings based on 
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actual writing sanples. Indirect assessments of writing ability 
seem to work well when the major goal of the assessment is sisply 
to rank order students, but they do not encourage the teaching and 
learning of writing. This wall known connection between 
procedures and teaching has provided the motivation for increased 
use of authentic and performance-based measurement of writing, as 
well as other competencies. 

It is beyond the scope of this paper to provide a detailed 
survey of other psychometric models that have been pmposed for 
direct as se s s m e n ts of writing. Briefly, these models can be 
grouped into two major approaches, one based on analysis of 
variance models and the other on linear structural equation models. 
Exanples of approaches to writing assessment based on analysis of 
variance models are the early work of Stanley (196 .) and the 
r esear ch of Braun (1988} en the calibration of essay raters. 
Generalizability theory (Cronbach, Gleser, Nanda & Rajaratnam, 
1972) has also been used to examine essay data by several 
researchers (Bunch & Littlefair, 1988; lane & Sabers, 1989). Blok 
(1985) and Actaraan & Smith (1988) present exanples of how linear 
structural equation models using USREL (Joreskog & Sorbom, 1979) 
can be used to address measurement problems related to writing 
assessment. 

Tnese two approaches are not adequate for a variety of 
reasons. First, they are based on raw scores that are non-linear 
representations of a writing ability variable, and do not directly 
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lead to scales that have equal units. Second, the unit of analysis 
far these two approaches is the raw score rather than individual 
rating. Recent advances in item response theory highlight the 
advantages of using the item r e spo n se directly rather than 
sunmerized as a raw score as the unit of analysis for both 
dichotomous and polytcnous response data. Item r e spon se can 
be developed to directly model the probability of a student 
obtaining a particular set of ratings based on an actual writing 
sample. Although an enpirical comparison of different approaches 
to direct writing as s ess me nt would be interesting, it is difficult 
to develop fair criteria for ccnparing these models because these 
approaches possess many of the characteristics of different 
paradigms (Kuhn, 1970) or research traditions (Iaudan, 1977) . 

Several Rasch-based approaches for modelling essay ratings 
have also been proposed. Anurich proposed a Boisson Process model 
based on the number of flaws observed in an essay (Andrich, 1973; 
Hake, 1986) . The Partial Credit model (Masters, 1982) has been 
used to examine writing data (Ferrara & WalXer-Bartnick, 1989; 
Harris, laan & Mossenson, 1988; FOllitt & Hutchinson, 1987). De 
Gruijter (1984) proposed two models (one additive and the other 
nonlinear) for rater effects; the nonlinear model is based on the 
pairwise Rasch model of Choppin (1982) . 

Although each of these Rasch-based models offers significant 
advantages over earlier approaches to writing assessment, they are 
all essentially two facet models (writing ability and rater 



o 

ERIC 



7 



Measurement of writing ability 

7 



severity) , and cannot adequately model assessment procedures that 
are designed to have m ulti p le facets. A recent extension of the 
Rasch model proposed by Linacre (1989) and presented here 
provides for multiple facets that can be calibrated simultaneously, 
but examined s ep a r ately, for exanple, the four facets defined in 
this study are writing ability, rater severity, writing-task 
difficulty, and domain difficulty. 

In summary, an assessment framework based on extensions of 
item response theory seems to offer a premising approach to the 
measurement of writing ability. The Many-Faceted Rasch (FACETS) 
m odel addressees many of the measurement problem encountered with 
other approaches to writing ass e ss m e n t. Rascb measurement models 
can provide a framework for obtaining objective and fair 
measurements of writing ability which are statistically invariant 
over raters, writing tasks and other aspects of the writing 
assessment process. A FACETS model for the direct assessment of 
writing ability is described in the next section based on the 
current pro c ed ur es used in Georgia for the Basic Skills Writing 
Test (BSWT) . Georgia's proced u res can serve as a prototype for 
other statewide assessments of writing. Next, a small data set is 
analyzed in order to illustrate the FACETS model. Finally, the 
implications of the FACETS model for theory, research and practice 
within the context of the statew^e assessment of writing ability 
are summarized. 
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BBaaffiBBBA "odel for the asspasroent of writing ability 

The measurement model underlying the writing assessment 
program used in Georgia is presented graphically in Figure 1. 



Insert Figure 1 about here 

The dependent variable in the model is the observed rating which 
ranges frcm 0 to 3 (0 = inadequate, 1 = minimal, 2 = good, 3 = very 
good) . The four major facets that influence this rating are 
writing ability, rater severity, the difficulty of the writing task 
and domain difficulty. The structure of the rating scale which 
defines the categories also affects the value of the rating 
obtained. Other statewide assessment of writing would require 
different forms of the FACETS model; for example, if holistic 
scoring is used, then the domain facet would not be necessary. 

Although not explicitly included in the measurement model, 
other student characteristics that reflect potential sources of 
bias may affect the observed rating of a student. Seme examples of 
these student characteristics are gender, age, ethnicity, social 
class and opportunity to learn. The biasing effects of these 
student characteristics can be examined after the facets are 
calibrated. Studies of Differential Facet Functioning (DFF) can be 
conducted by a variety of procedures that are conceptually similar 
to current approaches for studying differential item functioning 
(Engelhard, Anderson, & Gabrielson, 1990) . For example, the 



Measurement of writing ability 

9 

individual facets of the model for the assessment of writing 
ability could be calibrated separately for fenales and males, and 
the correspondence between these estimates examined to detect DFF. 
In t e r a c tions between the facets can also be examined as a potential 
source of bias in the assessment of writing ability. The 
measurement model c&n also be elaborated in order to examine 
hypotheses about why raters differ in severity, and also why 
writing tasks differ in difficulty. 
22S Manv-Faceted Rasch Model 

Hie FACETS model is an extension of Rasch measurement ncdels 
(Rasch, 1980; Wright & Stone, 1979; Wright & Masters, 1982) that 
can be used for writing assessments which include multiple facets, 
such as raters and writing tasks. For the Georgia writing data 
analyzed here, the Many-Faceted Rasch (FACEIS) modal can be written 
as follows: 

log [Pnijink/PnijmJc-1] -Eh-Ti-Rj-Dm-Fk 
where 

Pnijmk * pi:obability of student n being rated k en writing task i 
by rater j for domain m 
Pnijmk-1 * probability of student n being rated k-1 on writing task 
i by rater j for domain m 
Bi = writing ability of student n 
Ti = Difficulty of writing task i 
Rj - Severity of rater j 
Dm = Difficulty of domain m 
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Fk = Difficulty of rating Step k relative to Step k-1 
The student facet, Bn, provides a measure of writing ability on a 
linear logistic scale (logits) that ranges fran +/- infinity. If 
the data fits the FACETS model, then these estimates of writing 
ability are statistically invariant over raters and writing tasks. 
These estimates of writing ability are invariant because 
adjustments have been made for differences in rater severity and 
the difficulty of the writing task. The writing-task facet, Ti, 
calibrates the writing tasks on the same linear logistic scale, and 
provides an estimate of the relative difficulty of each writing 
task that is invariant over students and raters. Estimates of 
rater severity, Rj, are also obtained on the same linear logistic 
scale which are invariant over students and writing tasks. 
Finally, invariant calibrations of the domain facet, On, and rating 
scale step difficulties, Fk, are also obtained. The FACETS model 
is an additive linear mudel based on this logistic transformation 
to a legit scale. 

Enpirical Example 

Subjects 

Fifteen students were randomly selected from the Spring 1989 
administration of the Basic Skills writing Test (BSWT) that is 
a dm i n iste r ed to all of the eighth-grade students in Georgia. Seven 
of the s tude nts are female and eight are males; six of the students 
are black and nine are white. 

11 
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Instrument 

Ihe BSWT is a criterion-ref erenoed test designed to provide a 
direct assessment of student writing ability. Students are asked 
to write an essay of no more than two pages on an assigned writing 
task. The writing tasks are randomly assigned to the students. 
Each of the essays is rated by two raters on the following five 
domains: content/organization, style, sentence formation, usage and 
mechanics. A four category rating scale is used for each dona in 
(^inadequate, l=minimal, 2=good and 3=very good). The final 
response pattern used to estimate student writing ability consists 
of ten ratings (two raters x five domains - ten ratings) . 
Additional information on the BSWT is available in the Teacher's 
Guide (Georgia Department of Education, 1990) . 

The raters are highly trained and a variety of procedures are 
used to maintain the reliability and validity of the ratings. 
First, the raters must successfully complete an extensive training 
program; this program typically takes three days. Next, the raters 
go through a qualifying process in order to became an operational 
rater. During the qualifying process, each rater rates 20 essays 
and these ratings are oonpared with a set of standard ratings 
assigned by a validity orwnittee of writing experts. Raters with 
at least 62 percent exact agreement with the standard on the 
ratings and 38 percent adjacent category agreement can become 
operational raters. 
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Finally, two ongoing quality control procedures are used to 
monitor the raters during the actual process of rating student 
essays. First, validity papers with a set of standard ratings are 
included in each packet of 24 essays and rater agreement is 
examined continuously; the raters are not able to identify the 
validity paper. Second, each essay is rated by two raters, and if 
a large discrepancy is found, then the essay is re-scored by a 
third rater. Further details of the training procedures and the 
ongoing quality control processes are available in the Training 
Manual (Georgia Department of Education, 1989) . 

Although the full rhetorical specification of the writing 
tasks can not be revealed because this is a high-stakes test, the 
theme statements for the tasks examined hare are "where you would 
go if you wen an all expense paid trip" (Task 72) and "time you 
were successful" (Task 63) . The node of discourse for both of 
these tasks is narration. 



The FACETS cenputar program (Linacre, 1988) was used to 
analyze the data. A measurement model with four facets (writing 
ability, rater severity, writing task difficulty and domain 
difficulty) was estimated for the data. The rating scale m od e l 
with cann on step sizes across domains was used for the structure of 
the rating scale. The program calculates several fit statistics 
that provide evidence regarding the validity of the FACETS model. 
The standardized fit statistic is reported here which is based on a 
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transformation of the unweighted mean square residuals to an 
approximate t distribution (Wri^it & Masters, 1SS2) . This 
standardized fit statistic is sometimes referred to as the outfit 
statistic because it is sensitive to outlying deviations fran the 
expected values. Hie standardized fit statistics are rounded to 
the nearest integer by the FACETS program. Frr the purposes of 
this study, obtained values for standardized fit statistics that 
are less than 2 are interpreted as indicating acceptable fit to the 
FACETS model, in addition to the standardized fit statistic, a 
reliability coefficient which is similar to KR-20 (ratio of true 
score variance to observed score variance) is reported for each 
facet. Additional details regarding the computational and 
statistical aspects of the FACEIS model are presented in Ti ere 
(1989) . 

Results 

ine obse r ved ratings for the 15 students are presented in 
Table 1. For this exanple, two writing tasks (63 & 72) 



Insert Table 1 about here 

appeared and were rated by three raters (117, 197 and 232) . 

The calibration of the raters, writing tasks and domains on 
the linear logistic scale are shown in Figure 2. Task 72 is harder 



Insert Figure 2 about here 
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with a difficulty of .34 logits (£E » .27) as conpared to Tzzk 63 
with a difficulty of -.34 logits CSE - .26) . The reliability for 
the writing tasks is .42, p = .06, which s u g ge sts that the 
difference in the difficulties of these two writing tasks is close 
enough to the traditional critical value (p < .05) to be considered 
statistically significant. Both standardized fit statistics were 
less than 2; the obtained value for Task 72 is 0 and the value for 
Task 63 is -1. 

Rater 197 (R197 = 1.58, S£ = .53) is more severe than the 
other two raters (R117 = .-.57, ££ = .30; R232 = -1,00, SB - .26). 
The reliability coefficient for the raters is .88, p < .01 which 
indicates that there is significant variation among the raters 
beyond the variation due to estimation error. This significant 
variation in the raters appears in spite of the extensive training 
and screening of the raters. The standardized fit statistics 
indicate that intra-rater consistency is acceptable with observed 
values of 1, 0 and -1 for raters 197, 117 and 232 respectively. 

Turning to the five domains, the order of difficulty from 
hard to easy on the logistic scale is as follows: usage (Dl = .64, 
S£ » -4*) , style (D2 - .30, §E = .42), sentence formation (D3 = 
-.03, S£ = .41), mechanics (D4 = -.37, « .41) and 
content/organization (D5 = -.54, §£ « .42). The reliability is 
.03, p ■» .25 which suggests that there are np£ statistically 
significant differences in the relative difficulties of these five 
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domains. None of the standardized fit statistics is greater than 
2, and the observed values for four of the dam ins are Os, and -1 
for sentence formation. 

The cali b ration of the steps within the racing scales from 0 
to 3 with standard errors in parentheses are as follows: -6.57 
(.4C), .32 (.23) and 6.25 (.48). The observed proportions for the 
four categories from 0 to 3 are .09, .48, .37 and .06. This 
indicates that categories 1 and 2 are the most frequently used by 
these raters. 

Raw scores are calculated by sunning the ten ratings for each 
student. The raw scores range from 0 to a maxima of 30 (two 
raters x five domains x maximum rating of 3 for each domain) . The 
operational version of the BSWT includes differential weights for 
each domain, but this weighting is not used in the present example. 
Observed raw scores ranged from 5 to 25 Q$ = 13.9, SQ = 5.9). The 
Rasch estimates of writing ability for these 15 students are 
presented in Table 1, and these values ranged from -6.26 to 6.32 
logits Qf « -1.08, g? - 3.68) . The reliability coefficient is 
quite high for the student ability estimates (REL = .96, p < .01) . 
The correlation between the raw scores and the Rasch estimates is 
high, r(13) = .98, p < .01. This high correlation does not, 
however, eliminate the possibility that some raw scores are biased 
by variation in rater severity and writing task difficulty. 

In order to illustrate the consequences of not adjusting raw 
scores for rater and writing task eifects, the ratings for two 
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students with the same raw scores of 8 are presented in Table 2. 

Insert Table 2 about here 

These students bwe identical rating patterns, and yet the writing 
ability of Student 12 (B12 = -4.12) is estimated to be 2.0 legits 
greater than Student 4 (B4 = -6.26) . This difference in estimated 
writing ability is observed because Student 12 happened to be rated 
by Rater 197 who is more severe than Rater 117. 

The fit statistics for the Rasch ability estimates arv 
presented in Table 1. The observed values of the standardized fit 
statistic show acceptable fit of the data to the model for all of 
the students except Student 8. Student 8 has an observed fit 
statisti c of 2 and a detailed residual analysis for this student is 
presented in Table 2. For occparison purposes, the rating patterns 
for Students 4 and 12 who both have consistent ratings with 
standardized fit statistics close to zero are al so presented in 
Table 1. Fit statistics less than 2 indicate a close 
correspondence between the observed and expected ratings. For 
Student 4, only one of the standardized residuals is greater than 
twice its standard error, while none of the standardized residuals 
are significant far student 12. Student 8 has three unexpectedly 
high ratings from Rater 197 in style, usage and mechanics. This 
essay should be examined in detail to determine whether or not 
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there is anything unusual about it, such as illegible handwriting, 
an off-topic essay or a controversial response. 

m order to illustrate the consequences of ifit adjusting for 
differences in writing task difficulty, unadjusted estimates of 
writing ability were calculated. These are presented in Table 3. 



Insert Table 3 about here 

The data suggest that if students were asked to respond to Task 
63, then their writing abilities would on the average be over 
estimated by .34 logits; if they were asked to respond to Task 72, 
then their writing abilities would be under estimated by -.38 
logits. This is due to the differences in writing task difficulty 
with Task 72 being relatively more difficult than Task 63. 

A similar analysis was conducted for the influence of raters, 
and these results are presented in Table 4. Students who were rated 



Insert Table 4 about here 

by Raters 117 and 232 tend to have their writing abilities over 
estimated QJ « .70, §B - .37) , while students who were rated by 
Raters 197 and 232 tend to have their writing abilities under 
estimated (JJ = -.40, SB - .37) . This effect is due to the large 
differences in rater severity bbtyaen Rater 197 (R197 = 1.58) who 
tends to be more severe than Rater 117 (R117 ■ -.57) who tends to 
be more lenient. 



18 



Measurement of writing ability 

18 



When adjustments for differences in both writing task 
difficulty and rater severity are not made, then the average 
differences between the adjusted and unadjusted estimates of 
writing ability is .45 legits (SJ? = .71) . These results are shown 



Insert Table 5 about here 

in 'Table 5. Since the effects of writing task difficulty and rater 
severity are additive, some of the students have their writing 
abilities over estimated lay more than 1.00 logit (Students 1 to 7) 
if the unadjusted estimates are used. 

Discussion 

When the measurement of writing ability is based directly on 
student essays, there are many factors in addition to writing 
ability that can contribute to variability in the observed essay 
scores. Some of the major factors are differences in (1) rater 
severity (Dunz, Wright, & Linacre, 1990) , (2) writing task 
difficulty (Ruth & Murphy, 1988) , (3) domain difficulty when 
analytic scoring is used, (4) examinee characteristics other than 
ability (Brown, 1986) and (5) the structure of the rating scale. 
Ideally, the estimate of an individual's writing ability should be 
independent of the particular raters, writing tasks, and domains 
that happen to be used. Further, examine*; characteristics apart 
from writing ability, such as gender, race, ethnicity and social 
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class, should not influence the validity of the estimates of 
writing ability. 

The Many-Faceted Rasch (FACETS) model described by Linacre 
(1989) provides a coherent framework for obtaining estimates of 
writing ability that are invariant over raters, writing tasks and 
domains. Issues related to bias can also be explored with the 
FACETS model. The FACETS model provides a framework for obtaining 
objective linear measurements of writing ability that generalize 
beyond the specific raters and writing tasks that happen to be used 
to obtain the observed rating. The FACETS model can also be 
applied to the areygwnpn t of writing ability based on holistic 
scoring procedures (Cooper, 1977) . The structure of the rating 
scale can also be modelled using a Partial Credit model rather than 
the rating scale structure used here. 

The FACETS model provides the following advantages over other 
measurement models that have been used within the context of 
writing asses sments : 

1. The FACETS model is a scaling model based on a linear logistic 
transformation of the observed scores. The estimates of writing 
ability are specified to be on an equal-interval scale, in contra st 
to the ordinal scale underlying the raw score based approaches 
based on analysis of variance models or linear structural equation 
models. 

2. The FACETS model provides an explicit approach for examining 
the multiple facets encountered in the design of most writing 
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assessments. A sound theoretical framework is provided for 
adjusting for differences in raters and writing tasks. Adjustments 
for rater severity and writing-task difficulty iaprove the 
objectivity and fairness of the measurement of writing ability 
b ec aus e unadjusted scores lead to under or over estimates of 
writing ability when students axe rated by different raters on 
different writing tasks. 

3. The FACETS mo d e l is a Rasch measurement model, and possesses 
de s irable statistical and psychometric properties related to the 
separability of parameters with sufficient statistics available for 
estimating these parameters. 

4. If the data fit the FACETS model, then invariant estimates of 
writing ability, rater severity and writing task difficulty can be 
obtained which generalize beyond the specifics of the local writing 
a s sessmen t pr o ce dur es. Tests of fit and residual analyses are 
available to examine whether or not the data fit the FACETS model, 
and these desirable invariance properties achieved. 

5. The creation of rater and writing-task banks is straight 
forward, and can be viewed as siaple extensions of current item 
banking procedures. When the data fit the FACETS model, the 
creation of rater and writing task banks hpranps sinply a matter of 
adding and subtracting the appropriate linking cons ta nts, once the 
banks are created, then the equating of the ratings for the 
influences of raters and writing tasks is straight forward. These 
banks, however, must be continually maintained and validated. 
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6. Inraiylete research designs with missing cells and other forms 
of missing data can be handled routinely, if attention is paid to 
the construction of a c onn ected network of links within and between 
facets. Misfitting observations can be identified for diagnostic 
purposes and corrective actions taken when needed. 

7. Differential facet functioning (DFF) can be examined within 
different groups (gender, race and social class) in order to 
examine bias issues. This can be accomplished by calibrating the 
facets separately within relevant groups, and examining whether or 
not the relative difficulty of the ccnponents of the facet are 
invariant over groups. Interactions between facets can also be 
examined as a potential source of bias in the assessment of writing 
ability. 

In sumnary, the FACETS model offers a promising approach for 
solving a variety of measurement problems encountered in the 
statewide assessment of writing ability. Hie small exanple 
presented here was intended to illustrate the FACETS model, and not 
intended to provide a definitive examination of its usefulness for 
solving these measurement problems. Additional r e search based on 
operational forms with large writing sanples is needed to further 
examine the FACETS model within the context of the statewide 
assessment of writing ability. This research should address in 
detail the problems encountered in the aevelopnent of calibrated 
rater banks using the FACETS model. Further research on the use of 
the FACETS model to address measurement problems encountered in the 
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development of operational writing-task banks for a statewide 
assessment of writing ability is also needed. And finally, 
research is needed on differential facet functioning related to 
gender, race and social class; this research will contribute to our 
knowledge regarding the use of the FACETS model to examine 
potential sources of bias in statewide writing assessments. 
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Table 1 



Rater 117 Rater 197 Rater 232 

Raw Rasch 

Student 12345 12345 12345 Score Ability SE FIT 



Task 63 



1 


2 


2 2 


1 2 












2 


1 


2 


1 


1 


16 


-.40 


.65 


0 


2 


1 


1 1 


1 1 












2 


1 


1 


1 


1 


11 


-2.99 


.97 


0 


3 


2 


2 3 


2 3 












3 


2 


3 


2 


3 


25 


5.13 


.65 


-1 


4 


0 


0 1 


1 1 












1 


1 


1 


1 


1 


8 


-6.26 


.79 


0 


5 


2 


1 1 


1 1 












2 


2 


1 


1 


2 


14 


-1.24 


.86 


-1 


6 


2 


1 1 


2 1 












2 


2 


2 


2 


2 


17 


.05 


.69 


0 


7 


2 


2 1 


1 1 












2 


2 


1 


1 


1 


14 


-1.24 


.66 


0 






































8 








2 


3 


2 


3 


3 


2 


2 


2 


2 


3 


24 


6.32 


.75 


2 


9 


2 


2 2 


1 2 












2 


2 


2 


1 


2 


18 


1.25 


.76 


0 


10 








0 


0 


0 


0 


0 


1 


1 


1 


1 


1 


5 


-5.94 


.76 


-1 


11 


2 


1 2 


2 2 












2 


1 


1 


1 


2 


16 


.28 


.65 


0 


12 








0 


0 


1 


1 


1 


1 


1 


1 


1 


1 


8 


-4.12 


.82 


0 


13 


1 


2 1 


2 2 












2 


1 


1 


1 


1 


14 


-.56 


.66 


1 


14 








0 


0 


0 


3 


0 


1 


1 


1 


1 


1 


5 


-5.94 


.76 


-1 


15 


1 


1 2 


1 1 












1 


2 


2 


2 


1 


14 


-.56 


.66 


0 



Note. Each student is rated by thro raters using a four-category 

scale (0=inadeguate, l=minimal, 2=gcod, 3=very good) on each 
domain. The fivs domains are (1) ccntent/organization, 
(2) style, (3) sentence formation, (4) usage ant? (5) 
mechanics. SE is the standard error of the Rasch estimate 
of writing ability, and FIT is the standardized fit 
statistic (FT 1 values less than 2 indicate that the rating 
pattern fits the model. 



Tito 2 



tottr 117 



Studs* 



bar m 



3 4 S 



totr 23? 



1 2 3 4 5 



Rm Mi 

Ability FIT 



Oonsistgit toting 



9 0 1) 
.88 .72 .78 .64 
IWdal -.85* -.72 .22 .38 



12 



teida? 



U 

18 



9 0 1 
.75 .55 .5$ 
-.75 -.58 .35 



47 
53 



71 1 
29 - 



1111 
91 .80 .85 .71 .89 
99 J& .15 J5 .11 



91 



1111 
.98 .98 .93 1.98 



.09 



-6.26 



-4.12 



8 



2 3 2 3 3 2 2 2 2 3 
2.28 2.98 2.12 2.95 2.17 2.78 2.80 2.68 2.52 2.75 
-.29 .9? -.12 .95* .83* -.78 -.88 -.88 -J2 .25 



24 



6.32 2 



ZT ?* iiffm Stc-m *■ *r * m npKtd ratty. fetrlsta indiott raHdalt tnt « aort tf» 

trt^rstandard irrara. Tra fin dxains art (1) ccnt^/crganfatta, (2) styTi, (3) went fciwtim, (4) ungi 
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Table 3 

Comparison of writing ability estimates adjusted and unadjusted for 



Student 


Adjusted 
Estimate 


Unadjusted 
Estimate 


Difference 
Task 63 Task 72 


T 

X 


—.40 


—.05 


.35 


2 


-2.99 


-2.66 


.33 


3 


5.13 


5.61 


.48 


4 


-6.26 


-6.09 


.17 


5 


-1.24 


-.90 


.34 


6 


.05 


.40 


.35 


7 


-1.24 


-.90 


.34 


8 


6.32 


6.14 


-.18 


9 


1.25 


.93 


-.32 


10 


-5.94 


-6.46 


-.52 


11 


.28 


-.05 


-.33 


12 


-4.12 


-4.59 


-.47 


13 


-.56 


-.90 


-.34 


14 


-5.94 


-6.46 


-.52 


15 


-.56 


-.90 


-.34 




-1.08 


-1.12 


.34 -.38 




3.68 


3.81 


.09 .11 



Note . The adjusted ability estimates are the same as the X^sch 
ability estimates of writing ability reported in Table l. 
Differences are based on unadjusted minus adjusted estimates of 
writing ability. Negative values indicate under estimates, while 
positive values indicate over estimates of writing ability. 
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Table 4 

Oonparison of writim ability estimates adjusted and unadjusted for 
differences in BfcSE tiifftoltY 











Student 


Estimate 


Estimate 


117,232 


197,232 


1 


••40 


.27 


.67 




2 


-2.99 


-2.30 


.69 




*j 




6.15 


1.02 




4 


-6.26 


-5.62 


.64 




5 


-1.24 


-.56 


.68 




6 


.05 


.71 


• 66 




7 


-1.24 


-.56 


.68 




8 


6.32 


6.38 




.06 


9 


1.25 


1.92 


.67 




10 


-5.94 


-6.33 




-.39 


11 


.28 


.94 


.66 




12 


-4.12 


-4.97 




-.85 


13 


-.56 


.11 


.67 




14 


-5.94 


-6.33 




-.39 


15 


-.56 


.11 


.67 




Mean 


-1.08 


-.67 


.70 


-.40 




3.68 


3.96 


.37 


.37 



Note . The adjusted ability estimates ace the sane as the Rasch 
ability estimates of writing ability reported in Table 1. 
Differences are based on unadjusted minus adjusted estimates of 
writing ability. Negative values indicate under estimates, while 
positive values indicate over estimates of writing ability. 
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Table 5 

OTParifion of writing ability estimates adiu«tffi1 flTTfl ! ^adjusted for 
differences in rat/>r arri w^Jm t*V»ft difficulty 



Adjusted Unadjusted 
Student Estimate Estimate Difference 



1 


-.40 


.62 


1.02 


2 


-2.99 


-1.90 


1.09 


3 


5.13 


6.16 


1.03 


4 


-6.26 


-4.95 


1.31 


5 


-1.24 


-.20 


1.04 


6 


.05 


1.06 


1.01 


7 


-3 .24 


-.20 


1.04 


8 


6.32 


5.74 


-.58 


9 


1.25 


1.58 


.33 


10 


-5.94 


-6.35 


-.41 


11 


.28 


.62 


.34 


12 


-4.12 


-4.95 


-.83 


13 


-.56 


-.20 


.36 


14 


-5.94 


-6.35 


-.41 


15 


-.56 


-.20 


.36 


Mean 


-1.08 


-.63 


.45 


SB 


3.68 


3.80 


.71 



Note. The adjusted ability estimates are the same as the Rasch 
ability estimates of writing ability reported in Table 1. 
Differences are based en unadjusted minus adjusted estimates of 
writing ability. Negative values indicate under estimates, while 
positive values indicate over estimates of writing ability. 
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Figure 1 

^^^^^fl^QQSI^g nvx^ol £otr ^fop ^sscs^afttei it» of wirifcii^pi ability 



Rater Severity 



Difficulty of 
Writing Task. 
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Structure of 
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* Potential bias factors that are not 
measurement model. 



explicitly included in the 
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Figure 2 

Calibrator! of i^rs. wH^lm tflffKff fiPfl timing ffll l?Tifftic 
Writing Task Ba£aS fiSBaiDS 
2.0 + Hard Severe Hard 



1.5 + 



1.0 + 



.5 



0.0 



-.5 



<— Pater 197 (1.58) 



<— Usage (.64) 

<— Task 72 (.34) <— Style (.30) 



< — Sentence (-.03) 



<— Task 63 (-.34) 



<— Mechanics (-.37) 
<— Content (-.54) 

<— Rater 117 (-.57) 



-1.0 + <— Rater 232 (-1.00) 



-1.5 + Easy Lenient Easy 



